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(54) Interactive video object processing environment having concurrently active subordinate 
windows 



(57) A video processing environment includes a 
user interface and processing shell from which various 
video processing 'plug-in' programs are accessed. The 
shell insulates the plug-ins from the intricacies of read- 
ing various file formats. The user interface allows an 
operator to load a video sequence, define and view one 
or more video objects on any one or more frames of the 
video sequence, edit existing video object segmenta- 
tions, view video objects across a series of video 
frames, and encode video objects among a video 



sequence in a desired format. Various encoding param- 
eters can be adjusted allowing the operator to view the 
video sequence encoded at the various parameter set- 
tings. The user interface includes a video window, a 
time-line window, a zoom window, a set of menus 
including a menu of plug-in programs, and a set of dia- 
logue boxes, including encoding parameter dialogue 
boxes. 
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Description 

CROSS REFERENCE TO RELATED APPLICATIONS 

[0001] This invention is related to commonly- 5 
assigned U.S. Patent Application Serial No. 09/323,501, 
filed June 1, 1999 for "Video Object Segmentation 
Using Active Contour Modelling With Global Relaxa- 
tion," of Shijun Sun and Yongmin Kim; commonly- 
assigned U.S. Patent Application Serial No. w 

, attorney docket no. OT2.P55, 

filed on the same day, for "Interactive Video Object 
Processing Environment Which Visually Distinguishes 
Segmented Video Object," of Christopher Lau et a!.; 
and commonly-assigned U.S. Patent Application Serial is 

No - , attorney docket no. 

OT2.P56, filed on the same day, for "Interactive Video 
Object Processing Environment Having Zoom Window," 
of Christopher Lau et al. The content of all such applica- 
tions are incorporated herein by reference and made a 20 
part hereof. 

BACKGROUND OF THE INVENTION 

[0002] This invention relates to user interfaces and 25 
interactive processing environments for video editing, 
and more particularly to an interactive processing envi- 
ronment for video object segmentation, tracking and 
encoding. 

[0003] Graphical user interfaces having windows, 30 
buttons, dialogue boxes and menus are known, such as 
those available with the Apple Macintosh Operating 
System and the Microsoft Windows-based operating 
systems. The inventions disclosed herein relate to a 
graphical user interface adapted for video editing tasks, 35 
such as segmentation, tracking and encoding. 
[0004] Segmentation is the division of an image into 
semantically meaningful non-overlapping regions. 
When segmenting video, the regions are referred to as 
video objects. Tracking is a method for identifying a 40 
video object across a series of video frames. Encoding 
is the compression and formatting of video according to 
some conventional or proprietary encoding scheme, 
such as the MPEG-4 encoding scheme. 

45 

SUMMARY OF THE INVENTION 

[0005] According to the invention, a processing 
environment for video processing includes a user inter- 
face and processing shell from which various video 50 
processing 'plug-in' programs are executed. The user 
interface allows an operator to load a video sequence, 
define and view one or more video objects on any one 
or more of the frames of the video sequence, edit exist- 
ing video object segmentations, view video objects 
across a series of video frames, and encode video 
objects among a video sequence in a desired format 
(e.g., MPEG-4 encoding). Various encoding parameters 
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can be adjusted allowing the operator to view the video 
sequence encoded at the various parameter settings. 
One of the advantages of the processing environment is 
that an operator is able to do automatic segmentations 
across a sequence of video frames, rather than time 
consuming manual segmentations for each video 
frame. 

[0006] According to one aspect of the invention, the 
user interface includes a main window from which sub- 
ordinate windows are selectively displayed. Among the 
selectable subordinate windows are a video window, a 
time-line window, a zoom window, and an encoding win- 
dow. The user interface also includes a set of menus 
including a menu of plug-in programs, and a set of dia- 
logue boxes, including encoding parameter dialogue 
boxes. The video sequence is viewed and played in the 
video window using VCR-like controls. Video frames 
may be viewed in sequence or out of sequence (e.g., full 
motion video, stepping, or skipping around). The time- 
line window allows the operator to determine where 
within the sequence the current video frame is located. 
[0007] According to another aspect of the invention, 
an operator may define an object by selecting a com- 
mand button from the time-line window. The operator 
clicks on points in the video window to outline the por- 
tion of the displayed image which is to be the desired 
video object. 

[0008] According to another aspect of this inven- 
tion, the zoom window is concurrently active with the 
video window, while the operator defines the object. In 
particular, the pointing device cursor location is tracked 
concurrently in both the video window and the zoom 
window. Scrolling in the zoom window is automatic to 
track the pointing device cursor. One advantage of this 
is that the operator is able to view a location within the 
video frame, while also viewing a close-up of such loca- 
tion in the zoom window. This allows the operator to pre- 
cisely place a point on a semantical ly-correct border of 
the object (e.g., at the border of an object being 
depicted in video). In some embodiments the zoom win- 
dow shows a close-up of the pixels of the video window 
in the vicinity of the pointing device cursor. 
[0009] According to another aspect of this inven- 
tion, a segmentation plug-in program processes the 
video frame and selected outline to refine the object 
along semantical border lines of the object being 
depicted. The result is a video object. 
[0010] According to another aspect of the invention, 
a defined video object is highlighted by one or more of 
the following schemes: overlaying a translucent mask 
which adds a user-selectable color shade to the object; 
outlining the object; viewing the rest of the frame in 
black and white, while viewing the object in color; alter- 
ing the background to view the object alone against a 
55 solid (e.g., white, black, gray) background; applying one 
filter to the object and another filter to the background. 
[0011] According to another aspect of the invention, 
an operator is able to select timepoints in the time-line 
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window and a tracking algorithm from a plug-ins menu. 
The tracking algorithm identifies/extracts the defined 
object across a sequence of video frames. Thus, the 
operator is able to view the video sequence with high- 
lighted object from a selected starting time point to a 
selected end time point. Alternatively, the operator may 
view just the video object (without the remaining portion 
of the video frames) from such selected starting to end- 
ing time points. 

[0012] According to another aspect of the invention, 
the operator may step through the video frames from 
starting time point onward. The operator may stop or 
pause the stepping to adjust or redefine the video 
objects. An advantage of this aspect is that as the track- 
ing algorithm begins to lose the ability to accurately 
track an object, the object can be redefined. For exam- 
ple, as some of the background begins to be included 
as part of the video object during tracking over multiple 
frames, the boundaries of the video object can be rede- 
fined. Further, the object can be redefined into one or 
more sub-objects, with each sub-object being tracked 
and displayed from frame to frame. An advantage of the 
plug-in interface is that a common or different segmen- 
tation plug-ins may be used to segment different 
objects. For example, one segmentation plug-in may be 
well adapted for segmenting objects in the presence of 
affine motion, while another segmentation plug-in is bet- 
ter where the object deforms. Each segmentation plug- 
in may be applied to an object for which it is most effec- 
tive. 

[0013] According to another aspect of the invention, 
the time-line window indicates which frames of a 
sequence have been processed to track/extract a video 
object. 

[0014] According to another aspect of the invention, 
where sub-objects are being tracked the objects can be 
combined into a single object just before video encod- 
ing. The operator is able to select among a variety of 
encoding parameters, such as encoding bit rate, motion 
vector search range, and fidelity of the encoded shape. 
[001 5] According to another aspect of the invention, 
an encoding status of each object is displayed showing 
the peak signal to noise ratio for each color component 
of each frame and for the total number of bits encoded 
for each frame. An advantage of such display is that the 
operator is able to visualize how peak signal to noise 
ratio varies between video objects over a sequence of 
frames or how the total number of bits affects the peak 
signal to noise ratio of each color component of an 
object. When the image quality is unsatisfactory, these 
displays enable the operator to identify a parameter in 
need of adjusting to balance peak signal to noise ratio 
and the bit rate. For example, an operator is able to 
select a higher number of bits to encode one object and 
a lesser number of bits to encode another object to opti- 
mize image quality for a given number of bits. 
[0016] According to an advantage of this invention, 
various processing needs can be met using differing 



plug-ins. According to another advantage of the inven- 
tion, the processing shell provides isolation between the 
user interface and the plug-ins. Plug-ins do not directly 
access the video encoder. The plug-ins accomplish 

5 segmentation or tracking or another task by interfacing 
through an API - application program interface module. 
For example, a segmentation plug-in defines an object 
and stores the pertinent data in a video object manager 
portion of the shell. The encoder retrieves the video 

10 objects from the video object manager. Similarly, plug- 
ins do not directly draw segmentations on the screen, 
but store them in a central location. A graphical user 
interface module of the user interface retrieves the data 
from central location and draws the objects in the video 

75 window. As a result, the various plug-ins are insulated 
from the intricacies of reading various file formats. Thus, 
data can even be captured from a camcorder or down- 
loaded over a network through the user interface and 
shell, without regard for plug-in compatibilities. 

20 [0017] These and other aspects and advantages of 
the invention will be better understood by reference to 
the following detailed description taken in conjunction 
with the accompanying drawings. 

25 BRIEF DESCRIPTION OF THE DRAWINGS 

[0018] 

Fig. 1 is a block diagram of an interactive process- 
so ing environment for video object segmentation, 
tracking and encoding according to an embodiment 
of this invention; 

Fig. 2 is a block diagram of an exemplary host com- 
puting system for the interactive processing envi- 
35 ronment of Fig. 1; 

Fig. 3 is a window depiction of a main user interface 
window according to an embodiment of this inven- 
tion; 

Fig. 4 is a window depiction of a time-line subordi- 
40 nate window of the user interface of Fig. 3 accord- 
ing to one embodiment of this invention; 
Fig. 5 is a window depiction of a video object infor- 
mation subordinate window of the user interface of 
Fig. 3 according to one embodiment of this inven- 
45 tion; 

Rg. 6 is a window depiction of an encoder progress 
subordinate window of the user interface of Fig. 3 
according to one embodiment of this invention; 
Fig. 7 is a flow chart for an exemplary processing 
so scenario according to an embodiment of this inven- 
tion; 

Fig. 8 is a window depiction of a subordinate video 
window of the user interface of Fig. 3 according to 
one embodiment of this invention; 
55 Fig. 9 is a window depiction of a subordinate zoom 
window of the user interface of Fig. 3 according to 
one embodiment of this invention; and 
Fig. 10 is a depiction of a portion of a video image 
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showing a video object as designated by a translu- 
cent mask overlaying the object according to an 
embodiment of this invention. 

DESCRIPTION OF SPECIFIC EMBODIMENTS 5 
Overview 

[0019] Fig. 1 shows a block diagram of an interac- 
tive processing environment 1 0 for segmenting, tracking w 
and encoding video according to one embodiment of 
the invention. The processing environment 10 includes 
a user interface 12, a shell environment 14 and a plural- 
ity of functional software 'plug-in' programs 1 6. The user 
interface receives and distributed operator inputs from is 
various input sources, such as a point and clicking 
device 26 (e.g., mouse, touch pad, track ball), a key 
entry device 24 (e.g., a keyboard), or a prerecorded 
scripted macro 13. The user interface 12 also controls 
formatting outputs to a display device 22. The shell envi- 20 
ronment 14 controls interaction between plug-ins 16 
and the user interface 12. An input video sequence 1 1 
is input to the shell environment 14. Various plug-in pro- 
grams 1 6a-1 6n may process all or a portion of the video 
sequence 11 . One benefit of the shell 14 is to insulate 25 
the plug-in programs from the various formats of poten- 
tial video sequence inputs. Each plug-in program inter- 
faces to the shell through an application program 
interface ('API') module 18. 

[0020] In one embodiment the interactive process- 30 
ing environment 10 is implemented on a programmed 
digital computer of the type which is well known in the 
art, an example of which is shown in Fig. 2. A computer 
system 20 has a display 22, a key entry device 24, a 
pointing/clicking device 26, a processor 28, and random 35 
access memory (RAM) 30. In addition there commonly 
is a communication or network interface 34 (e.g., 
modem; ethernet adapter), a non-volatile storage 
device such as a hard disk drive 32 and a transportable 
storage media drive 36 which reads transportable stor- 40 
age media 38. Other miscellaneous storage devices 40, 
such as a floppy disk drive, CD-ROM drive, zip drive, 
bernoulli drive or other magnetic, optical or other stor- 
age media, may be included. The various components 
interface and exchange data and commands through 45 
one or more buses 42. The computer system 20 
receives information by entry through the key entry 
device 24, pointing/clicking device 26, the network inter- 
face 34 or another input device or input port. The com- 
puter system 20 may be any of the types well known in 50 
the art, such as a mainframe computer, minicomputer, 
or microcomputer and may serve as a network server 
computer, a networked client computer or a stand alone 
computer. The computer system 20 may even be con- 
figured as a workstation, personal computer, or a 55 
reduced-feature network terminal device. 
[0021] In another embodiment the interactive 
processing environment 10 is implemented in an 
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embedded system. The embedded system includes 
similar digital processing devices and peripherals as the 
programmed digital computer described above. In addi- 
tion, there are one or mere input devices or output 
devices for a specific implementation, such as image 
capturing. 

[0022] In a best mode embodiment software code 
for implementing the user interface 12 and shell envi- 
ronment 14, including computer executable instructions 
and computer readable data are stored on a digital 
processor readable storage media, such as embedded 
memory, RAM, ROM, a hard disk, an optical disk, a 
floppy disk, a magneto-optical disk, an electro-optical 
disk, or another known or to be implemented transport- 
able or non-transportable processor readable storage 
media. Similarly, each one of the plug-ins 16 and the 
corresponding API 18, including digital processor exe- 
cutable instructions and processor readable data are 
stored on a processor readable storage media, such as 
embedded memory, RAM, ROM, a hard disk, an optical 
disk, a floppy disk, a magneto-optical disk, an electro- 
optical disk, or another known or to be implemented 
transportable or no n -transportable processor readable 
storage media. The plug-ins 16 (with the corresponding 
API 18) may be bundled individually on separate stor- 
age media or together on a common storage medium. 
Further, none, one or more of the plug-ins 16 and the 
corresponding API's 18 may be bundled with the user 
interface 12 and shell environment 14. Further, the var- 
ious software programs and plug-ins may be distributed 
or executed electronically over a network, such as a glo- 
bal computer network. 

[0023] Under various computing models, the soft- 
ware programs making up the processing environment 
1 0 are installed at an end user computer or accessed 
remotely. For stand alone computing models, the exe- 
cutable instructions and data may be loaded into volatile 
or non-volatile memory accessible to the stand alone 
computer. For non-resident computer models, the exe- 
cutable instructions and data may be processed locally 
or at a remote computer with outputs routed to the local 
computer and operator inputs received from the local 
computer. One skilled in the art will appreciate the many 
computing configurations that may be implemented. For 
non-resident computing models, the software programs 
may be stored locally or at a server computer on a pub- 
lic or private, local or wide area network, or even on a 
global computer network. The executable instructions 
may be run either at the end user computer or at the 
server computer with the data being displayed at the 
end user's display device. 

Shell Environment 

[0024] The shell environment 14 allows an operator 
to work in an interactive environment to develop, test or 
use various video processing and enhancement tools. 
In particular, plug-ins for video object segmentation, 
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video object tracking and video encoding (e.g., com- 
pression) are supported in a preferred embodiment. Dif- 
fering segmentation algorithms can be developed, 
tested and implemented as plug-ins 1 6 for researcher or 
end user implementation. Similarly, different tracking 
algorithms and tracking schemes can be implemented 
as plug-ins 16 to track and extract video object data 
from a sequence of video frames. The interactive envi- 
ronment 10 with the shell 14 provides a useful environ- 
ment for creating video content, such as MPEG-4 video 
content or content for another video format. A pull-down 
menu or a pop up window is implemented allowing an 
operator to select a plug-in to process one or more 
video frames. 

[0025] According to a preferred embodiment the 
shell 14 includes a video object manager. A plug-in pro- 
gram 1 6, such as a segmentation program accesses a 
frame of video data, along with a set of user inputs 
through the shell environment 14. A segmentation plug- 
in program identifies a video object within a video frame. 
The video object data is routed to the shell 14 which 
stores the data within the video object manager module. 
Such video object data then can be accessed by the 
same or another plug-in 1 6, such as a tracking program. 
The tracking program identifies the video object in sub- 
sequent video frames. Data identifying the video object 
in each frame is routed to the video object manager 
module. In effect video object data is extracted for each 
video frame in which the video object is tracked. When 
an operator completes all video object extraction, edit- 
ing or filtering of a video sequence, an encoder plug-in 
16 may be activated to encode the finalized video 
sequence into a desired format. Using such a plug-in 
architecture, the segmentation and tracking plug-ins do 
not need to interface to the encoder plug-in. Further, 
such plug-ins do not need to support reading of several 
video file formats or create video output formats. The 
shell handles video input compatibility issues, while the 
user interface handles display formatting issues. The 
encoder plug-in handles creating a run-time video 
sequence. 

[0026] For a Microsoft Windows operating system 
environment, the plug-ins 16 are compiled as dynamic 
link libraries. At processing environment 10 run time, 
the shell 1 4 scans a predefined directory for plug-in pro- 
grams. When present, a plug-in program name is added 
to a list which is displayed in a window or menu for user 
selection. When an operator selects to run a plug-in 16, 
the corresponding dynamic link library is loaded into 
memory and a processor begins executing instructions 
from one of a set of pre-defined entry points for the plug- 
in. To access a video sequence and video object seg- 
mentations, a plug-in uses a set of callback functions. A 
plug-in interfaces to the shell program 14 through a cor- 
responding application program interface module 18. 
[0027] In addition, there is a segmentation interface 
44 portion of the user interface 12 which is supported by 
a segmentation plug-in. The segmentation interface 44 



makes calls to a segmentation plug-in to support opera- 
tor selected segmentation commands (e.g., to execute 
a segmentation plug-in, configure a segmentation plug- 
in, or perform a boundary selection/edit). 

5 [0028] The API's 1 8 typically allow the correspond- 
ing plug-in to access specific data structures on a linked 
need-to-access basis only. For example, an API serves 
to fetch a frame of video data, retrieve video object data 
from the video object manager, or store video object 

10 data with the video object manager. The separation of 
plug-ins and the interfacing through API's allows the 
plug-ins to be written in differing program languages 
and under differing programming environments than 
those used to create the user interface 12 and shell 14. 

is In one embodiment the user interface 12 and shell 14 
are written in C++. The plug-ins can be written in any 
language, such as the C programming language. 
[0029] In a preferred embodiment each plug-in 1 6 is 
executed in a separate processing thread. As a result, 

20 the user interface 1 2 may display a dialog box that plug- 
ins can use to display progress, and from which a user 
can make a selection to stop or pause the plug-in's exe- 
cution. 

25 User- Interface Windows 

[0030] Referring to Figs. 1 and 3, the user interface 
12 includes the segmentation interface 44 and various 
display windows 54-62, dialogue boxes 64, menus 66 

30 and button bars 68, along with supporting software code 
for formatting and maintaining such displays. In a pre- 
ferred embodiment as shown in Fig. 3, the user inter- 
face is defined by a main window 50 within which a user 
selects one or more subordinate windows 52, each of 

35 which may be concurrently active at a given time. The 
subordinate windows 52 may be opened or closed, 
moved and resized. The main window 50 includes a title 
bar 65, a menu bar 66 and a button bar 68. In some 
embodiments the various bars 65-68 may be hidden or 

40 viewed at the operator's preference. The main window 
also includes a window area 70 within which the subor- 
dinate windows 52 and dialogue boxes 64 may be 
viewed. 

[0031] In a preferred embodiment there are several 
45 subordinate windows 52, including a video window 54, a 
zoom window 56, a time-line window 58, one or more 
encoder display windows 60, and one or more data win- 
dows 62. The video window 54 displays a video frame 
or a sequence of frames. For viewing a sequence of 
50 frames, the frames may be stepped, viewed in real time, 
viewed in slow motion or viewed in accelerated time. 
Included are input controls accessible to the operator by 
pointing and clicking, or by predefined key sequences. 
There are stop, pause, play, back, forward, step and 
55 other VCR-like controls for controlling the video presen- 
tation in the video window 54. In some embodiments 
there are scaling and scrolling controls also for the 
video window 54. 
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[0032] The zoom window 56 displays a zoom view 
of a portion of the video window 54 at a substantially 
larger magnification than the video window. The portion 
displayed in the zoom window 56 is automatically deter- 
mined according to the position of the pointing device 5 
26 cursor. As the operator moves the pointing device 
cursor, the zoom window scrolls to follow the cursor and 
maintain the cursor within the zoom window 56. In a 
preferred embodiment, the zoom window 56 supporting 
software keeps the pointing device cursor approxi- 10 
mately centered. The purpose of the zoom window 56 is 
to allow the operator to see down to the pixel level of the 
video window 54. in doing so, an operator is able to click 
on a very specific point of the video window 54. More 
particularly, an operator can accurately place a bound- 15 
ary point of an image object, so as to provide a seman- 
tically accurate input for segmenting a video object. By 
"semantical ly accurate" it is meant that a selected point 
can be accurately placed at the image border of an 
object image (e.g., the edge of a house, tree, hand, or 2 o 
other image object being viewed). 

[0033] Referring to Fig. 4, the time-line window 58 
includes an incremental time-line 72 of video frames, 
along with zero or more thumb nail views 74 of select 
video frames. The operator may click on any point along 25 
the time-line 72 and the corresponding image frame is 
displayed in the video window 54. The frames corre- 
sponding to the thumb nail views 74 are selected man- 
ually by the operator or automatically. The location and 
number 73 of such frames are marked on the time-line 30 
72. 

[0034] The operator also can select a starting frame 
and an ending frame to view a clip of the input video 
sequence and define the processing range. Such selec- 
tions are highlighted along the time-line 72. In one 35 
embodiment the line 76 designates the starting frame 
and the line 78 designates the ending frame. The oper- 
ator selects the starting and ending points 76, 78 then 
selects 'play' to play the video clip or 'segment' to track 
objects. 40 
[0035] The time line window 58 also includes a 
respective time-line 80 for each video object defined for 
the input video sequence 11. A video object is defined 
by outlining the object followed by segmenting the 
object. The outlining provides a course user-selected 45 
boundary of the object. The zoom window allows accu- 
rate selection of points along such boundary. In one 
embodiment there are two outlining modes. In one 
mode the user draws the entire outline in a continuous 
mouse motion by holding down a mouse button. In so 
another mode, the user clicks on various border points 
of the object. Double-clicking is used to signal enclosing 
of an object in the user-defined polygon. Editing options 
allow the operator to move the entire outline, add addi- 
tional points between previously selected points, or 55 
remove previously selected points. During segmenta- 
tion, the object boundary is refined to define the video 
object more accurately. 



[0036] As a video object is tracked from frame to 
frame using a tracking plug-in, the corresponding time- 
line 80 highlights the frames for which such video object 
has been identified, tracked, and extracted. In particu- 
lar, the video object data is extracted for each frame in 
which the object is tracked and stored with the video 
object manager. For example, the time-line window 
depicted in Fig. 4 shows a time-line 80a for a video 
object A and another time-line 80b for a video object B. 
As shown both objects were defined or tracked to the 
same starting frame denoted on time-line 72 by line 76. 
Video object A data was extracted all the way to the 
ending frame (note line 78) of the video clip. Markings 
82 provide a visual cue to convey such information to 
the operator. Video object B data was extracted for only 
a portion of the excerpt as depicted by the markings 84. 
The time-lines 72 and 80 also include a marker 81 
which indicates the current frame being viewed in the 
video window 54. 

[0037] Referring to Fig. 5, a data window 62 is 
shown for a given video object. In one embodiment, the 
operator merely double clicks on the object name in the 
time-line window 58 or on the object in the video window 
54 (or enters some other predefined input), which 
causes the data window 62 for such video object to 
appear. The data window 62 includes user-input fields 
for an object title, translucent mask color, encoding tar- 
get bit rate, search range and other parameters for use 
in defining and encoding the corresponding video 
object. 

[0038] Referring to Fig. 6, during encoding an 
encoder progress window 86 is displayed. The encoder 
progress window 86 is one of the encoder displays 60, 
and shows the encoding status for each defined video 
object in the input video sequence 11. In one embodi- 
ment there is a respective information area 88 display- 
ing the number of encoding bits versus frame number 
and a peak signal to noise ratio (PSNR) versus frame 
number for each video object. In the display areas 88 
depicted, there is a set of bar graphs at each interval of 
select frame number intervals. One bar at each interval 
corresponds to the number of bits encoded, another 
corresponds to the overall PSNR of the frame's pixel 
data, another corresponds to the PSNR for the Y com- 
ponent of the frame's pixel data, another corresponds to 
the PSNR for the Cb color component of the frame's 
pixel data and a fourth corresponds to PSNR for the Cr 
color component of the frame's pixel data. One skilled in 
the art will appreciate that there are many formats in 
which such information and additional encoding infor- 
mation may be displayed to the operator. The bar 
graphs allow the operator to visualize how PSNR varies 
among video objects over a sequence of frames or how 
PSNR for a given component varies among video 
objects over a sequence of frames. In addition, by pre- 
senting information for each Y, Cb and Cr components 
separately, the operator can visualize how the total bit 
rate affects the PSNR of each component of an object. 
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When the image quality or the amount of compression 
is not satisfactory, the operator can view the graphs to 
aid in determining which parameters to adjust to 
achieve a more desirable balance between the PSNR 
and bit rate results before running the encoder again. 5 

Exemplary Plug-Ins 

[0039] In a preferred embodiment, there is a Video 
Object Segmentation plug-in 1 6a, a Video Object Track- 10 
ing plug-in 16b and an MPEG-4 plug-in 16n. An exem- 
plary embodiment for each of the segmentation plug-in 
16a and the tracking plug-in 16b are described in com- 
monly-assigned U.S. Patent Application serial No. 
09/323,501, filed June 1, 1999 for "Video Object Seg- is 
mentation Using Active Contour Modelling With Global 
Relaxation," of Shijun Sun and Yongmin Kim, the con- 
tent of which is incorporated herein by reference and 
made a part hereof. 

[0040] The video object manager represents the 20 
segmented video objects either as polygons or binary 
masks. Polygons can be converted into binary masks by 
scan conversion. Binary masks can be converted into 
polygons by finding and linking the edges of the mask. 
A plug-in may access the video object in either format. 25 
The MPEG-4 plug-in is a conventional MPEG-4 encoder 
such as developed by the Microsoft Corporation of Red- 
mond, Washington. 

Processing Scenario 30 

[0041] Referring to Fig. 7, an exemplary processing 
scenario 90 commences at step 92 with the operator 
selecting a command to load in an input file. In one 
embodiment a dialogue box opens with a list of files in a 35 
select directory. The input file may be a still image or a 
video sequence 1 1 in any of a variety of conventional or 
proprietary formats. Once the file is loaded, the first 
frame is displayed in the video window 54. In addition, 
the time-line window 58 opens. If the file has predefined 40 
video objects then the time-lines 80 for such objects 
appear in the time-line window 58. If not, then just the 
video sequence time-line 72 and thumb print views 74 
are shown. 

[0042] The operator may access the VCR-like con- 45 
trols of the video window 54 to play back or move 
around in the input video sequence 1 1 . Alternatively, the 
operator may click on a location along the time-line 72 to 
select a frame to view in video window 54. Using one of 
these procedures, at step 94 the operator moves to a so 
desired video frame. Consider the example where the 
operator desires to track or extract an object within the 
video sequence. For example, the video sequence 1 1 
may include motion video of a person. The operator 
may desire to extract the view of the person and apply 55 
the view to a different background. Or, perhaps there is 
an object that is moving quite fast and is not well- 
defined. To perform some video process of extraction or 



enhancement, a segmentation process and tracking 
process is to be performed on all or a portion of the 
video sequence. 

[0043] At step 96, the operator accesses the seg- 
mentation interface 44 (such as by accessing a set of 
commands from a pull-down menu). The operator 
selects a command to define a video object. At step 98, 
the operator then clicks on points at the peripheral edge 
of the desired object as shown in Fig. 8. In doing so, the 
operator is making a polygon of line segments. As pre- 
viously described, the zoom window 56 (see Fig. 9) 
allows the operator to better see the surrounding region 
near the location of a cursor 110 of a mouse or other 
pointing device. Referring to Fig. 9, the region near the 
cursor 1 1 0 is shown in the zoom window 56 with suffi- 
cient precision to click on a selected pixel 1 1 2 of the 
video window 54. In particular each pixel of the video 
window 54 in the vicinity of the pointer 1 10 is displayed 
as a large block of pixels on the zoom window 56. Using 
window 56 to guide the cursor movement in window 54, 
the operator selects a pixel in window 54, and the 
change is reflected in pixel block 1 12 of window 56. As 
a result the operator is able to make a very accurate 
selection of object boundary points. Referring to Fig. 10, 
once the operator indicates that the selection of bound- 
ary points is complete the boundary is closed. Once the 
selection of points is complete, segmentation can occur. 
At step 100 a segmentation plug-in is activated. The 
segmentation plug-in receives the video frame data and 
the object points from the shell 14 and the segmentation 
plug-in API 18. The segmentation plug-in redefines the 
edges of the object in the video frame from being a 
large-segmented polygon to a more semantically-accu- 
rate object edge. An edge derivation process is per- 
formed as part of the segmentation process to estimate 
where there is a semantically-accurate edge (e.g., edge 
of man distinct from background). 
[0044] In defining the video object the operator is 
able to select a method of highlighting the object. 
Among the various methods are overlaying a translu- 
cent mask 116 (see Fig. 10) which adds a color shade 
to the video data for the object. For example, the opera- 
tor may select a color filter for a given object. Different 
objects are assigned differing colors to allow the opera- 
tor to see the various video objects defined in a given 
video frame. Alternatively, a thick line can be selected to 
outline the object. Other methods include showing the 
video object in black and white or normal color while 
showing the background in the opposite, or making the 
background black, white or another pattern. In various 
embodiments any of many kinds of filtering operations 
can be performed to visually distinguish the video object 
from the portions of the video frame which are not part 
of one or more defined video objects. The video object 
itself is the original video pixel data. The overlayed 
mask or the filtered alteration of displayed data serves 
as a visual cue for distinguishing the video object from 
the remaining portion of a video frame. 
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[0045] Rather than have the operator go through 
every frame manually and select object points, an auto- 
matic segmentation is achieved using a tracking plug-in. 
At step 102 an operator selects a starting video frame 
and an ending video frame. This is done in the time-line 
window 58. Time-line 72 displays a line 76 indicating the 
starting frame, a line 78 indicating the ending frame and 
a marker 81 indicating the location along the time-line of 
the frame currently displayed in video window 54. With 
the segmentation active and the tracking active, the 
operator plays the desired portion of the video 
sequence from the selected starting point 76 to the 
selected ending point 78 at step 104 to accomplish 
tracking of the video object(s). For each frame, the 
tracking plug-in receives the video frame data for a cur- 
rent frame, along with the object data from the prior 
frame. The tracking plug-in then identities the object in 
the current frame. In some embodiments the segmenta- 
tion plug-in automatically refines the object definition for 
such frame. In other embodiments, segmentation is per- 
formed only when specified by the operator. For exam- 
ple, the operator may see the tracking program is 
starting to include additional pixels or exclude pixels of 
the semantical ly correct object. As a result, the operator 
goes to the frame where such tracking error com- 
mences. The operator can now redefine the object or 
alternatively define the video object into a set of video 
sub-objects. Each sub-object then is tracked using dif- 
ferent segmentation and tracking plug-ins from that 
point to the end point 78. Another use for defining video 
sub-objects is where there is a hole or transparent part 
of the video object which shows a moving background. 
By excluding the inner hole portion, a sub-object repre- 
senting the rest of the imaged object without the hole is 
tracked as a video sub-object. 

[0046] Once the video object has been accurately 
tracked to the operator's satisfaction, the operator can 
perform image processing enhancement or another 
desired function as available from one of the plug-ins 
16. In one example, the operator may save just the 
video object in a sequence (e.g., a sequence of a man 
extracted from the input video sequence, where the 
man is shown against some predefined or other back- 
ground.) In such example, the video sub-objects form- 
ing an object are combined with the aggregate video 
object being saved as a separate video sequence. The 
separate sequence may stand-alone (e.g., object on 
some predefined background) or be combined with 
another video clip, in effect overlaying the video object 
onto another video sequence. 

[0047] At some point, the operator makes a deter- 
mination to save a video clip as an encoded output file. 
Such encoded output file may become an input video 
sequence at a later time, or may be a canned clip 
exported for viewing on a display device outside the 
processing environment 10. In a preferred embodiment 
an MPEG-4 encoder is included as a plug-in. At step 
1 06, the operator selects the MPEG-4 encoder to com- 



press a video clip and create an output file. Unless pre- 
viously combined, any video sub-objects which together 
form a video object are combined into an aggregate 
video object prior to encoding. 

5 [0048] As previously described, an encoding 
progress display 86 allows the operator to analyze the 
output quality by viewing the peak signal to noise ratio 
per component or per number of bits used in encoding. 
In addition, the operator can alter some encoding 

10 parameters, such as bit rate, motion vector search 
range and fidelity of encoded shape. The operator can 
view the results for many different encodings to find the 
encoding settings that provide the desired trade-off to 
achieve a satisfactory image quality at some number of 

15 bits encoded per pixel. 

Meritorious and Advantageous Effects 

[0049] According to an advantage of this invention, 

20 various processing needs can be met using differing 
plug-ins. According to another advantage of the inven- 
tion, the processing shell provides isolation between the 
user interface and the plug-ins. Plug-ins do not directly 
access the video encoder. The plug-ins accomplish 

25 segmentation or tracking or another task by interfacing 
through an API - application program interface module. 
For example, a segmentation plug-in defines an object 
and stores the pertinent data in a video object manager 
portion of the shell. The encoder retrieves the video 

30 objects from the video object manager. Similarly, plug- 
ins do not directly draw segmentations on the screen, 
but store them in a central location. A graphical user 
interface module of the user interface retrieves the data 
from central location and draws the objects in the video 

35 window. As a result, the various plug-ins are insulated 
from the intricacies of reading various file formats. Thus, 
data can even be captured from a camcorder or down- 
loaded over a network through the user interface and 
shell, without regard for plug-in compatibilities. 

40 [0050] An advantage of the automatically scrolling 
zoom window is that the operator may view a location 
within the video frame, while also viewing a close-up of 
such location in the zoom window. This allows the oper- 
ator to precisely place a point on a semantical ly -correct 

45 border of the object (e.g., at the border of an object 
being depicted in video). 

[0051] An advantage of the encoding progress dis- 
play is that the operator is able to visualize how peak 
signal to noise ratio varies between video objects over a 

so sequence of frames or how the total number of bits 
affects the peak signal to noise ratio of each component 
of an object. When the image quality is unsatisfactory, 
these displays enable the operator to identify a parame- 
ter in need of adjusting to balance peak signal to noise 

55 ratio and the bit rate. 

[0052] Although a preferred embodiment of the 
invention has been illustrated and described, various 
alternatives, modifications and equivalents may be 
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used. Therefore, the foregoing description should not be 
taken as limiting the scope of the inventions which are 
defined by the appended claims. 

Claims 

1. A system including a display, an input device, a 
processor and a memory storage media and 
embodying an interactive video processing environ- 
ment, the system further comprising: 10 

code means for generating display content in a 
main display window of the interactive video 
processing environment; 

code means for generating display content, 15 
pertaining to a segmented video object of an 
image, in a first subordinate window displayed 
within the main window; and 
code means for generating display content, 
pertaining to an image encoding process, in a 20 
second subordinate window concurrently dis- 
played with the first subordinate window within 
the main window. 

2. The system of claim 1 , further comprising: 25 

code means for generating display content in 
the first subordinate window for tracking the 
segmented video object among a plurality of 
video frames. 

3. The system of claim 1 , in which the code means for 
generating display content in the first subordinate 
window generates an image in the first subordinate 
window, the image including the segmented video 35 
object, and in which the code means for generating 
display content in the second subordinate window 
generates display content including encoding con- 
trol parameters for encoding the image. 

40 

4. The system of claim 3, in which the code means for 
generating display content in the second subordi- 
nate window generates display content including a 
desired bit rate and a desired intra-coding period. 

45 

5. The system of claim 1 , in which the code means for 
generating display content in the second subordi- 
nate window generates display content including 
encoding control parameters for encoding the seg- 
mented video object across a plurality of video so 
frames. 

6. The system of claim 1 , in which the code means for 
generating display content in the second subordi- 
nate window generates display content including 55 
encoding status information for an encoding of a 
plurality of segmented video objects across a plu- 
rality of video frames. 



7. The system of claim 6, in which the display content 
in the second subordinate window comprises a bit 
rate for an encoding of each one of the plurality of 
segmented video objects. 

8. The system of claim 6, in which the display content 
in the second subordinate window comprises a sig- 
nal to noise ratio for an encoding of each one of the 
plurality of segmented video objects. 

9. The system of claim 6, in which the second subor- 
dinate window comprises a plurality of display 
graphs, in which there is a one to one correspond- 
ence between respective ones of the plurality of 
display graphs and each one of the plurality of seg- 
mented video objects. 

10. The system of claim 6, further comprising: 

code means for logically combining the plurality 
of segmented video objects into a combined 
object, and wherein the code means for gener- 
ating display content pertaining to said encod- 
ing pertains to an encoding of the combined 
object. 

11. The system of claim 1, in which the input device 
comprises a pointing device, the system displaying 
a plurality of first subordinate windows concurrently 
in the main window, a first one of the plurality of first 
subordinate windows displaying a video frame, a 
second one of the plurality of first subordinate win- 
dows displaying a zoomed in area of the video 
frame currently displayed in the first one, wherein a 
cursor of the pointing device is concurrently dis- 
played in the first one and the second one of the 
plurality of first subordinate windows, and wherein 
the second one automatically scrolls to maintain the 
pointing device cursor in view. 

12. The system of claim 11, in which the code means 
for generating display content pertaining to the seg- 
mented video object generates an outline in 
response to user-selected points in the first one of 
the plurality of first subordinate windows. 

13. The system of claim 11, in which the video frame 
displayed in the first one of the plurality of first sub- 
ordinate windows comprises video content includ- 
ing the segmented video object and other video 
content, in which the code means for generating 
display content pertaining to the segmented video 
object generates a translucent object which over- 
lays the segmented video object, the translucent 
object distinguishing the segmented video object 
from the other video content in the first one of the 
plurality of first subordinate windows. 
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14. The system of claim 1, in which the input device 
comprises a pointing device, the system displaying 
a plurality of first subordinate windows concurrently 
in the main window, a first one of the plurality of first 
subordinate windows displaying a time sequence of 
video frames, a second one of the plurality of first 
subordinate windows displaying a time line, 
wherein the code means for generating display con- 
tent pertaining to the segmented video object gen- 
erates a marking along the time-line for each video 
frame that has been segmented to define the seg- 
mented video object. 

15. A digital processor readable storage medium for 
storing processor-executable instructions and proc- 
essor-accessible data for maintaining an interactive 
video processing environment of display windows 
in response to user inputs, the medium comprising: 

code means for generating display content in a 
main display window of the interactive video 
processing environment; 

code means for generating display content per- 
taining to a segmented video object in a first 
subordinate window displayed within the main 
window; and 

code means for generating display content, 
pertaining to an image encoding process, in a 
second subordinate window concurrently dis- 
played with the first subordinate window within 
the main window. 

16. The medium of claim 15 further comprising: 

code means for generating display content in 
the first subordinate window for tracking the 
segmented video object among a plurality of 
video frames. 

17. The medium of claim 15, in which the code means 
for generating display content in the first subordi- 
nate window generates an image in the first subor- 
dinate window, the image including the segmented 
video object, and in which the code means for gen- 
erating display content in the second subordinate 
window generates display content including encod- 
ing control parameters for encoding the image. 

18. The medium of claim 15, in which the code means 
for generating display content in the second subor- 
dinate window comprises means for generating dis- 
play content including a desired bit rate and a 
desired intra-coding period. 

19. The medium of claim 15, in which the code means 
for generating display content in the second subor- 
dinate window comprising code means for generat- 
ing display content including encoding control 



parameters for encoding the segmented video 
object across a plurality of video frames. 

20. The medium of claim 15, in which the code means 
5 for generating display content in the second subor- 
dinate window comprises code means for generat- 
ing display content including encoding status 
information for an encoding of a plurality of seg- 
mented video objects across a plurality of video 

10 frames. 

21. The medium of claim 20, in which the encoding sta- 
tus information display content generating code 
means comprises means for generating a bit rate 

15 for an encoding of each one of the plurality of seg- 
mented video objects. 

22. The medium of claim 20, in which the encoding sta- 
tus information display content generating code 

20 means comprises means for generating a signal to 

noise ratio for an encoding of each one of the plu- 
rality of segmented video objects. 

23. The medium of claim 20, in which the second sub- 
25 ordinate window comprises a plurality of display 

graphs, in which there is a one to one correspond- 
ence between respective ones of the plurality of 
display graphs and ones of the plurality of seg- 
mented video objects. 

30 

24. The medium of claim 20, further comprising: 

code means for logically combining the plurality 
of segmented video objects into a combined 
35 object, and in which the code means for gener- 

ating display content in the second subordinate 
window generates display content pertaining to 
an encoding of the combined object. 

40 25. The medium of claim 15, further comprising: 

code means for generating a plurality of first 
subordinate windows concurrently in the main 
window, a first one of the plurality of first subor- 

45 dinate windows displaying a video frame, a 

second one of the plurality of first subordinate 
windows displaying a zoomed in area of the 
video frame currently displayed in the first one, 
wherein a cursor of a pointing device is concur- 

so rently displayed in the first one and the second 

one of the plurality of first subordinate win- 
dows, and wherein the second one automati- 
cally scrolls to maintain the pointing device 
cursor in view. 

55 

26. The medium of claim 25, in which the code means 
for generating display content pertaining to the seg- 
mented video object generates an outline in 
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response to user-selected points in the first one of 
the plurality of first subordinate windows. 

27. The medium of claim 25, in which the video frame 
displayed in the first one of the plurality of first sub- 5 
ordinate windows comprises video content includ- 
ing the segmented video object and other video 
content, in which the code means for generating 
display content pertaining to the segmented video 
object generates a translucent object which over- io 
lays the segmented video object, the translucent 
object distinguishing the segmented video object 
from the other video content in the first one of the 
plurality of first subordinate windows. 

75 

28. The medium of claim 15, further comprising: 

code means for generating a plurality of first 
subordinate windows concurrently in the main 
window, a first one of the plurality of first subor- 20 
dinate windows displaying a time sequence of 
video frames, a second one of the plurality of 
first subordinate windows displaying a time 
line, wherein the code means for generating 
display content pertaining to the segmented 25 
video object generates a marking along the 
time-line for each video frame that has been 
segmented to define the segmented video 
object. 

30 

29. A method for interactively processing a video 
sequence on a system having a display, an input 
device, and a processor, the method comprising the 
steps of: 

35 

generating a main display window of a video 
processing environment; 
generating a first subordinate window within 
the main display window for displaying a motion 
video sequence of video frames; *o 
responding to a user input to hold the motion 
video sequence at a select video frame; 
outlining a boundary of a video object; 
segmenting the outlined video object; 
playing at least a portion of the motion video 45 
sequence, during which the video object is 
tracked to define the segmented video object 
among a plurality of frames of the motion video 
sequence; 

selectively generating a second subordinate so 
window within the main display window which 
is concurrently active with the first subordinate 
window, the second subordinate window dis- 
playing information pertaining to an encoding 
process of said plurality of frames of the motion 55 
video sequence; and 
encoding said plurality of video frames. 



30. The method of claim 29, further comprising the 
steps of: 

selectively generating a third subordinate win- 
dow within the main display which is concur- 
rently active with the first subordinate window, 
the third subordinate window displaying a 
zoomed in area of the first subordinate window, 
wherein a cursor of the input device is concur- 
rently displayed in the first subordinate window 
and the third subordinate window; 
automatically scrolling the third subordinate 
window to maintain the pointing device cursor 
in view during the step of outlining the bound- 
ary of the video object. 

31 . The method of claim 29, further comprising the step 
of: 

overlaying a translucent object onto the seg- 
mented video object 

32. The method of claim 29, further comprising the 
steps of: 

selectively generating a third subordinate win- 
dow within the main display which is concur- 
rently active with the first subordinate window, 
the third subordinate window displaying a time 
line; 

generating a marking along the time-line for 
each video frame in which the segmented 
video object has been tracked. 

33. The method of claim 29, in which a plurality of video 
objects are segmented and tracked, and further 
comprising the steps of: 

selectively generating a third subordinate win- 
dow within the main display which is concur- 
rently active with the first subordinate window, 
the third subordinate window displaying a 
respective time line for each one of the plurality 
of video objects; 

for each respective time- line, generating a 
marking along said respective time-line for 
each video frame in which a corresponding one 
of the plurality of video objects has been 
tracked. 

34. The method of claim 29, in which a plurality of video 
objects are segmented and tracked, and further 
comprising the step of: 

logically combining the plurality of objects into 
a combined object prior to the step for encod- 
ing, and wherein the plurality of objects are 
treated as the combined object during the 
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encoding step. 

35. The method for claim 29, in which a plurality of 
video objects are segmented and tracked, and fur- 
ther comprising the step of: s 

generating display content in the second sub- 
ordinate window pertaining to an encoding of a 
plurality of segmented video objects, said dis- 
play content comprising a bit rate for the encod- io 
ing of each one of the plurality of segmented 
video objects. 

36. The method for claim 29, in which a plurality of 
video objects are segmented and tracked, and fur- 15 
ther comprising the step of: 

generating display content in the second sub- 
ordinate window pertaining to an encoding of a 
plurality of segmented video objects, said dis- 20 
play content comprising a signal to noise ratio 
for the encoding of each one of the plurality of 
segmented video objects. 

37. The method of claim 29, in which a plurality of video 25 
objects are segmented and tracked, and further 
comprising the step of: 

generating a display graph in the second sub- 
ordinate window for each one of the seg- 30 
mented video objects. 
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